Association Rule Mining is used when you want to find an association between different objects in a set, find frequent patterns in any information repository. It generates a set of rules called Association Rules. In simple words, it gives you output as rules in form if this then that.
Consider the following example:
We have a set of transaction data, numbered 1 to 5. Each transaction shows items bought in that transaction. You can see that Diaper is bought with Beer in three transactions. Similarly, Bread is bought with milk in three transactions making them both frequent item sets. Association rules are given in the form as below:
The part before => is referred to as if (Antecedent) and the part after => is referred to as then (Consequent).
Where A and B are sets of items in the transaction data. A and B are disjoint sets.
For example. if we have this rule:
20% transaction show Anti-virus software is bought with purchase of a Computer;
60% of customers who purchase Anti-virus software is bought with purchase of a Computer
Basic concepts of Association Rule Mining:
Itemset: Collection of one or more items. K-item-set means a set of k items.
Support Count: Frequency of occurrence of an item-set.
Support (s): Fraction of transactions that contain the item-set ‘X’.
Confidence (c): For a rule A=>B Confidence shows the percentage in which B is with A. The number of transactions with both A and B divided by the total number of transactions having A.
Lift: Lift gives the correlation between A and B in the rule A=>B. Correlation shows how one item-set A effects the item-set B.
If the rule had a lift of 1,then A and B are independent and no rule can be derived from them.
If the lift is > 1, then A and B are dependent on each other, and the degree of which is given by ift value.
If the lift is < 1, then presence of A will have negative effect on B.
Association Rule Mining is viewed as a two-step approach:
Frequent Itemset Generation: find all frequent item-sets with support >= pre-determined min_support count.
Rule Generation: list all Association Rules from frequent item-sets. Calculate Support and Confidence for all rules.
Frequent Itemset Generation is the most computationally expensive step because it requires a full database scan.
For this APRIORI Algorithm is used. It states:
“Any subset of a frequent itemset must also be frequent. In other words, no superset of an infrequent itemset must be generated or tested”.
The following figure shows how much APRIORI helps to reduce the number of sets to be generated:
If item-set {a,b} is infrequent then we do not need to take into account all its super-sets.
Let’s understand this by an example. In the following example, you will see why APRIORI is an effective algorithm and also generate strong association rules step by step.
Here APRIORI plays its role and helps reduce the number of the Candidate list, and useful rules are generated at the end. In the following steps, you will see how we reach the end of Frequent Itemset generation, that is the first step of Association rule mining.
Next step will be to list all frequent itemsets. You will take the last non-empty Frequent Itemset, which in this example is L2={I1, I2},{I2, I3}. Then make all non-empty subsets of the item-sets present in that Frequent Item-set List. Follow along as shown in below illustration:
You can see above there are four strong rules. For example, take I2=>I3 having confidence equal to 75% tells that 75% of people who bought I2 also bought I3.
Let’s get on to the code.
We will use the dataset df_join that contains design and test smells associated to classes of several projects, on more releases.
NameTag: the name of project’s version.
HashCommit: the hash of the commit associated to that release.
Date: the date in which the release is committed.
Project Name: the name of the project.
Package Name: package that contains the class having design/test smells.
Type Name: the class having design/test smells.
Cause of the smell: a description of the cause of that design smell.
ClassName: the concatenation between Package Name and Type Name.
Test suite: the class of test associated to production code.
Test smells: what test smells betweenar1, et1, it1, gf1, se1, mg1, ro1 presents that class of test.
Smells: contains an union between the design and test smells affecting that class.
#install and load package arules
#install.packages("arules")
library(arules)
## Caricamento del pacchetto richiesto: Matrix
##
## Caricamento pacchetto: 'arules'
## I seguenti oggetti sono mascherati da 'package:base':
##
## abbreviate, write
#install and load arulesViz
#install.packages("arulesViz")
library(arulesViz)
#install and load tidyverse
#install.packages("tidyverse")
library(tidyverse)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.3 v purrr 0.3.4
## v tibble 3.1.2 v dplyr 1.0.6
## v tidyr 1.1.3 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x tidyr::pack() masks Matrix::pack()
## x dplyr::recode() masks arules::recode()
## x tidyr::unpack() masks Matrix::unpack()
#install and load readxml
#install.packages("readxml")
library(readxl)
#install and load knitr
#install.packages("knitr")
library(knitr)
#load ggplot2 as it comes in tidyverse
library(ggplot2)
#install and load lubridate
#install.packages("lubridate")
library(lubridate)
##
## Caricamento pacchetto: 'lubridate'
## I seguenti oggetti sono mascherati da 'package:arules':
##
## intersect, setdiff, union
## I seguenti oggetti sono mascherati da 'package:base':
##
## date, intersect, setdiff, union
#install and load plyr
#install.packages("plyr")
library(plyr)
## ------------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## ------------------------------------------------------------------------------
##
## Caricamento pacchetto: 'plyr'
## I seguenti oggetti sono mascherati da 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Il seguente oggetto è mascherato da 'package:purrr':
##
## compact
library(dplyr)
Use read_csv(path to file) to read the dataset df_join.
df_join <- read.csv("df_join.csv")
df_join
Before applying Association Rule mining, we need to convert dataframe into transaction data so that all smells that appears together in one class are in one row.
We can group smells using NameClass. We need this grouping and apply a function on it and store the output in another dataframe. This can be done by ddply.
The following lines of code will combine all set of smells from one ClassName as one row. Each set of smells is separated by ,.
library(plyr)
#ddply(dataframe, variables_to_be_used_to_split_data_frame, function_to_be_applied)
transactionData <- ddply(df_join,c("ClassName"),
function(df_join)paste(df_join$Smells,
collapse = ","))
#The R function paste() concatenates vectors to character and separated results using collapse=[any optional charcater string ]. Here ',' is used
Next, as ClassName will not be of any use in the rule mining, we can set it to NULL.
transactionData$ClassName <- NULL
This format for transaction data is called the basket format. Next, you have to store this transaction data into a .csv (Comma Separated Values) file. For this, write.csv().
write.csv(transactionData,"transactions.csv", quote = FALSE, row.names = FALSE)
We load this transaction data into an object of the transaction class. This is done by using the R function read.transactions of the arules package.
The following line of code will take transaction data file transactions.csv which is in basket format and convert it into an object of the transaction class.
tr <- read.transactions('transactions.csv', format = 'basket', sep=',')
## Warning in asMethod(object): removing duplicated items in transactions
tr
## transactions in sparse format with
## 1269 transactions (rows) and
## 25 items (columns)
summary(tr)
## transactions as itemMatrix in sparse format with
## 1269 rows (elements/itemsets/transactions) and
## 25 columns (items) and a density of 0.1374626
##
## most frequent items:
## ar1 et1
## 1023 998
## Deficient Encapsulation Insufficient Modularization
## 369 336
## Unutilized Abstraction (Other)
## 332 1303
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4 5 6 7 8 9
## 61 183 454 372 135 47 11 4 2
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 3.000 3.000 3.437 4.000 9.000
##
## includes extended item information - examples:
## labels
## 1 ar1
## 2 Broken Hierarchy
## 3 Broken Modularization
The summary(tr) is a very useful command that gives us information about our transaction object. Let’s take a look at what the above output says:
There are 1269 transactions (rows) and 223 items (columns). Note that 223 are the smells involved in the dataset and 1269 transactions are collections of these items.
Density tells the percentage of non-zero cells in a sparse matrix. You can say it as the total number of smells that are detected divided by a possible number of smells in that matrix. You can calculate how many items were purchased by using density: 1269x223x0.006979119=1974.96.
You can generate an itemFrequencyPlot to create an item Frequency Bar Plot to view the distribution of items.
# Create an item frequency plot for the top 20 items
if (!require("RColorBrewer")) {
# install color package of R
install.packages("RColorBrewer")
#include library RColorBrewer
library(RColorBrewer)
}
## Caricamento del pacchetto richiesto: RColorBrewer
itemFrequencyPlot(tr,topN=20,type="absolute",col=brewer.pal(8,'Pastel2'), main="Absolute Item Frequency Plot")
In itemFrequencyPlot(tr,topN=20,type="absolute") first argument is the transaction object to be plotted that is tr. topN allows you to plot top N highest frequency items. type can be type="absolute" or type="relative". If absolute it will plot numeric frequencies of each item independently. If relative it will plot how many times these items have appeared as compared to others.
itemFrequencyPlot(tr,topN=20,type="relative",col=brewer.pal(8,'Pastel2'),main="Relative Item Frequency Plot")
Next step is to mine the rules using the APRIORI algorithm. The function apriori() is from package arules.
design_smells = c('Deficient Encapsulation', 'Unutilized Abstraction', 'Feature Envy', 'Broken Hierarchy', 'Broken Modularization', 'Insufficient Modularization', 'Wide Hierarchy', 'Unnecessary Abstraction', 'Multifaceted Abstraction', 'Cyclically-dependent Modularization','Cyclic Hierarchy', 'Rebellious Hierarchy')
test_smells = c("ar1", "et1", "it1", "gf1", "se1", "mg1", "ro1")
# Min Support as 0.001, confidence as 0.8.
association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8, minlen=2, maxlen=2),
appearance = list(lhs=design_smells, rhs=test_smells))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 2 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[19 item(s)] done [0.00s].
## set transactions ...[19 item(s), 1269 transaction(s)] done [0.00s].
## sorting and recoding items ... [18 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2
## Warning in apriori(tr, parameter = list(supp = 0.001, conf = 0.8, minlen = 2, :
## Mining stopped (maxlen reached). Only patterns up to a length of 2 returned!
## done [0.00s].
## writing ... [9 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
The apriori will take tr as the transaction object on which mining is to be applied. parameter will allow you to set min_sup and min_confidence. The default values for parameter are minimum support of 0.1, the minimum confidence of 0.8, maximum of 2 items (maxlen), minimum of 2 items (minlen). With appearance we specified the LHS (IF part) with the array of design_smell and RHS (THEN part) with the test_smell.
summary(association.rules)
## set of 9 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 9
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01024 Min. :0.8125 Min. :0.01261 Min. :1.033
## 1st Qu.:0.05910 1st Qu.:0.8542 1st Qu.:0.06777 1st Qu.:1.072
## Median :0.14657 Median :0.8721 Median :0.15997 Median :1.082
## Mean :0.14097 Mean :0.8910 Mean :0.15936 Mean :1.121
## 3rd Qu.:0.24586 3rd Qu.:0.9163 3rd Qu.:0.26478 3rd Qu.:1.165
## Max. :0.25374 Max. :1.0000 Max. :0.29078 Max. :1.240
## count
## Min. : 13.0
## 1st Qu.: 75.0
## Median :186.0
## Mean :178.9
## 3rd Qu.:312.0
## Max. :322.0
##
## mining info:
## data ntransactions support confidence
## tr 1269 0.001 0.8
summary(association.rules) shows the following:
Parameter Specification: min_sup=0.001 and min_confidence=0.8 values with 2 items as max of items in a rule and 2 items as min of items in a rule.
Total number of rules: The set of 9 rules
Summary of Quality measures: Min and max values for Support, Confidence and, Lift.
Information used for creating rules: The data, support, and confidence we provided to the algorithm.
Since there are 9 rules, we don’t have to filter only the top 10:
inspect(association.rules)
## lhs rhs support confidence
## [1] {Cyclic Hierarchy} => {et1} 0.01024429 0.8125000
## [2] {Multifaceted Abstraction} => {ar1} 0.01497242 1.0000000
## [3] {Feature Envy} => {et1} 0.06067770 0.8953488
## [4] {Feature Envy} => {ar1} 0.05910165 0.8720930
## [5] {Cyclically-dependent Modularization} => {et1} 0.14657210 0.9162562
## [6] {Insufficient Modularization} => {et1} 0.25374310 0.9583333
## [7] {Insufficient Modularization} => {ar1} 0.22616233 0.8541667
## [8] {Deficient Encapsulation} => {et1} 0.24586288 0.8455285
## [9] {Deficient Encapsulation} => {ar1} 0.25137904 0.8644986
## coverage lift count
## [1] 0.01260835 1.033129 13
## [2] 0.01497242 1.240469 19
## [3] 0.06776990 1.138475 77
## [4] 0.06776990 1.081805 75
## [5] 0.15996848 1.165059 186
## [6] 0.26477541 1.218562 322
## [7] 0.26477541 1.059567 287
## [8] 0.29078014 1.075126 312
## [9] 0.29078014 1.072384 319
#inspect(association.rules[1:10])
Using the above output, you can make analysis such as:
Classes which have design smell ‘Cyclic Hierarchy’ will have the test smell ‘et1’ with support of 0.01 and confidence of 0.81.
Classes which have design smell ‘Multifaceted Abstraction’ will have the test smell ‘ar1’ with support of 0.0149 and confidence of 1.
Classes which have design smell ‘Feature Envy’ will have the test smell ‘et1’ with support of 0.06 and confidence of 0.89.
Classes which have design smell ‘Feature Envy’ will have the test smell ‘ar1’ with support of 0.059 and confidence of 0.87.
Since there will be hundreds or thousands of rules generated based on data, you need a couple of ways to visualize these association rules.
A straight-forward visualization of association rules is to use a scatter plot using plot() of the arulesViz package. It uses Support and Confidence on the axes. In addition, third measure Lift is used by default to color (grey levels) of the points.
# Filter rules with confidence greater than 0.4 or 40%
subRules<-association.rules[quality(association.rules)$confidence>0.4]
#Plot SubRules
plot(subRules)
The above plot shows that rules with high lift have low support. You can use the following options for the plot:
plot(subRules,method="two-key plot")
The two-key plot uses support and confidence on x and y-axis respectively. It uses order for coloring. The order is the number of items in the rule.
Graph-Based Visualizations
Graph-based techniques visualize association rules using vertices and edges where vertices are labeled with item names, and item sets or rules are represented as a second set of vertices. Items are connected with item-sets/rules using directed arrows. Arrows pointing from items to rule vertices indicate LHS items and an arrow from a rule to an item indicates the RHS. The size and color of vertices often represent interest measures.
Graph plots are a great way to visualize rules but tend to become congested as the number of rules increases. So it is better to visualize less number of rules with graph-based visualizations.
Let’s select 9 rules from subRules having the highest confidence.
top10subRules <- head(subRules, n = 9, by = "confidence")
Now, plot an interactive graph:
Note: You can make all your plots interactive using engine=htmlwidget parameter in plot
plot(top10subRules, method = "graph", engine = "htmlwidget")
This representation is also called as Parallel Coordinates Plot.
As mentioned above, the RHS is the Consequent; the positions are in the LHS where 2 is the most recent addition to our basket and 1 is the item we previously had.
# Filter top 9 rules with highest lift
subRules2<-head(subRules, n=9, by="lift")
plot(subRules2, method="paracoord")
It shows that when the class has ‘Multifaceted Abstraction’, with high probability It will also have ‘ar1’.
Next step is to run APRIORI algorithm with design-design smell:
# Min Support as 0.001, confidence as 0.8.
association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8, minlen=2, maxlen=4), appearance = list(both=design_smells))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 4 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[12 item(s)] done [0.00s].
## set transactions ...[12 item(s), 1269 transaction(s)] done [0.00s].
## sorting and recoding items ... [11 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 4
## Warning in apriori(tr, parameter = list(supp = 0.001, conf = 0.8, minlen = 2, :
## Mining stopped (maxlen reached). Only patterns up to a length of 4 returned!
## done [0.00s].
## writing ... [7 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
The apriori will take tr as the transaction object on which mining is to be applied. parameter will allow you to set min_sup and min_confidence. The default values for parameter are minimum support of 0.1, the minimum confidence of 0.8, maximum of 4 items (maxlen), becouse with maxlen=2 there were 0 rules, minimum of 2 items (minlen). With appearance we specified the both parameter with the array of design_smell.
summary(association.rules)
## set of 7 rules
##
## rule length distribution (lhs + rhs):sizes
## 3 4
## 2 5
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 3.500 4.000 3.714 4.000 4.000
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.001576 Min. :0.8000 Min. :0.001576 Min. : 3.021
## 1st Qu.:0.001970 1st Qu.:1.0000 1st Qu.:0.001970 1st Qu.: 3.777
## Median :0.002364 Median :1.0000 Median :0.002364 Median : 3.822
## Mean :0.002477 Mean :0.9714 Mean :0.002589 Mean : 7.006
## 3rd Qu.:0.002758 3rd Qu.:1.0000 3rd Qu.:0.003152 3rd Qu.: 3.822
## Max. :0.003940 Max. :1.0000 Max. :0.003940 Max. :27.000
## count
## Min. :2.000
## 1st Qu.:2.500
## Median :3.000
## Mean :3.143
## 3rd Qu.:3.500
## Max. :5.000
##
## mining info:
## data ntransactions support confidence
## tr 1269 0.001 0.8
We obtained 7 rules that are showed here:
inspect(association.rules)
## lhs rhs support confidence coverage lift count
## [1] {Broken Modularization,
## Unnecessary Abstraction} => {Unutilized Abstraction} 0.003940110 1.0 0.003940110 3.822289 5
## [2] {Feature Envy,
## Unnecessary Abstraction} => {Unutilized Abstraction} 0.002364066 1.0 0.002364066 3.822289 3
## [3] {Broken Modularization,
## Deficient Encapsulation,
## Unnecessary Abstraction} => {Unutilized Abstraction} 0.002364066 1.0 0.002364066 3.822289 3
## [4] {Broken Modularization,
## Deficient Encapsulation,
## Unutilized Abstraction} => {Unnecessary Abstraction} 0.002364066 1.0 0.002364066 27.000000 3
## [5] {Cyclic Hierarchy,
## Cyclically-dependent Modularization,
## Deficient Encapsulation} => {Insufficient Modularization} 0.001576044 1.0 0.001576044 3.776786 2
## [6] {Cyclically-dependent Modularization,
## Deficient Encapsulation,
## Feature Envy} => {Insufficient Modularization} 0.003152088 0.8 0.003940110 3.021429 4
## [7] {Broken Hierarchy,
## Cyclically-dependent Modularization,
## Deficient Encapsulation} => {Insufficient Modularization} 0.001576044 1.0 0.001576044 3.776786 2
#inspect(association.rules[1:10])
Using the above output, you can make analysis such as:
Classes which have design smells ‘Broken Modularization’ and ‘Unnecessary Abstraction’ will have the design smell ‘Unutilized Abstraction’ with support of 0.0039 and confidence of 1.
Classes which have design smell ‘Feature Envy’ and ‘Unnecessary Abstraction’ will have the design smell ‘Unutilized Abstraction’ with support of 0.0023 and confidence of 1.
Since there will be hundreds or thousands of rules generated based on data, you need a couple of ways to visualize these association rules.
A straight-forward visualization of association rules is to use a scatter plot using plot() of the arulesViz package. It uses Support and Confidence on the axes. In addition, third measure Lift is used by default to color (grey levels) of the points.
# Filter rules with confidence greater than 0.4 or 40%
subRules<-association.rules[quality(association.rules)$confidence>0.4]
#Plot SubRules
plot(subRules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The above plot shows that rules with high lift have low support. You can use the following options for the plot:
plot(subRules,method="two-key plot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The two-key plot uses support and confidence on x and y-axis respectively. It uses order for coloring. The order is the number of items in the rule.
Graph-Based Visualizations
Graph-based techniques visualize association rules using vertices and edges where vertices are labeled with item names, and item sets or rules are represented as a second set of vertices. Items are connected with item-sets/rules using directed arrows. Arrows pointing from items to rule vertices indicate LHS items and an arrow from a rule to an item indicates the RHS. The size and color of vertices often represent interest measures.
Graph plots are a great way to visualize rules but tend to become congested as the number of rules increases. So it is better to visualize less number of rules with graph-based visualizations.
Let’s select 7 rules from subRules having the highest confidence.
top10subRules <- head(subRules, n = 7, by = "confidence")
Now, plot an interactive graph:
Note: You can make all your plots interactive using engine=htmlwidget parameter in plot
plot(top10subRules, method = "graph", engine = "htmlwidget")
This representation is also called as Parallel Coordinates Plot.
As mentioned above, the RHS is the Consequent; the positions are in the LHS where 2 is the most recent addition to our basket and 1 is the item we previously had.
# Filter top 7 rules with highest lift
subRules2<-head(subRules, n=7, by="lift")
plot(subRules2, method="paracoord")
It shows that when the class has ‘Broken Modularization’, ‘Deficient Encapsulation’, ‘Unutilized Abstraction’ with high probability It will also have ‘Unnecessary Abstraction’.
Next step is to run APRIORI algorithm with test-test smell:
# Min Support as 0.001, confidence as 0.8.
association.rules <- apriori(tr, parameter = list(supp=0.001, conf=0.8, minlen=2, maxlen=2), appearance = list(both=test_smells))
## Apriori
##
## Parameter specification:
## confidence minval smax arem aval originalSupport maxtime support minlen
## 0.8 0.1 1 none FALSE TRUE 5 0.001 2
## maxlen target ext
## 2 rules TRUE
##
## Algorithmic control:
## filter tree heap memopt load sort verbose
## 0.1 TRUE TRUE FALSE TRUE 2 TRUE
##
## Absolute minimum support count: 1
##
## set item appearances ...[7 item(s)] done [0.00s].
## set transactions ...[7 item(s), 1269 transaction(s)] done [0.00s].
## sorting and recoding items ... [7 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2
## Warning in apriori(tr, parameter = list(supp = 0.001, conf = 0.8, minlen = 2, :
## Mining stopped (maxlen reached). Only patterns up to a length of 2 returned!
## done [0.00s].
## writing ... [13 rule(s)] done [0.00s].
## creating S4 object ... done [0.00s].
The apriori will take tr as the transaction object on which mining is to be applied. parameter will allow you to set min_sup and min_confidence. The default values for parameter are minimum support of 0.1, the minimum confidence of 0.8, maximum of 2 items (maxlen), minimum of 2 items (minlen). With appearance we specified the both parameter with the array of test_smell.
summary(association.rules)
## set of 13 rules
##
## rule length distribution (lhs + rhs):sizes
## 2
## 13
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2 2 2 2 2 2
##
## summary of quality measures:
## support confidence coverage lift
## Min. :0.01261 Min. :0.8000 Min. :0.01576 Min. :1.049
## 1st Qu.:0.07171 1st Qu.:0.8585 1st Qu.:0.08353 1st Qu.:1.095
## Median :0.08983 Median :0.8880 Median :0.09850 Median :1.116
## Mean :0.16973 Mean :0.8765 Mean :0.19840 Mean :1.760
## 3rd Qu.:0.15603 3rd Qu.:0.9000 3rd Qu.:0.18125 3rd Qu.:1.144
## Max. :0.66509 Max. :0.9185 Max. :0.80615 Max. :9.577
## count
## Min. : 16.0
## 1st Qu.: 91.0
## Median :114.0
## Mean :215.4
## 3rd Qu.:198.0
## Max. :844.0
##
## mining info:
## data ntransactions support confidence
## tr 1269 0.001 0.8
We filtered the first 10 rules that are showed here:
#inspect(association.rules)
inspect(association.rules[1:10])
## lhs rhs support confidence coverage lift count
## [1] {ro1} => {mg1} 0.01260835 0.8000000 0.01576044 9.577358 16
## [2] {ro1} => {et1} 0.01418440 0.9000000 0.01576044 1.144389 18
## [3] {ro1} => {ar1} 0.01418440 0.9000000 0.01576044 1.116422 18
## [4] {mg1} => {et1} 0.07407407 0.8867925 0.08353034 1.127595 94
## [5] {mg1} => {ar1} 0.07171001 0.8584906 0.08353034 1.064931 91
## [6] {it1} => {et1} 0.08983452 0.9120000 0.09850276 1.159647 114
## [7] {it1} => {ar1} 0.08747045 0.8880000 0.09850276 1.101537 111
## [8] {se1} => {et1} 0.09771474 0.9185185 0.10638298 1.167936 124
## [9] {se1} => {ar1} 0.09613869 0.9037037 0.10638298 1.121017 122
## [10] {gf1} => {et1} 0.15602837 0.8608696 0.18124507 1.094633 198
Using the above output, you can make analysis such as:
Classes which have test smell ‘ro1’ will have the test smell ‘mg1’ with support of 0.0126 and confidence of 0.8.
Classes which have test smell ‘ro1’ will have the test smell ‘et1’ with support of 0.0141 and confidence of 0.9.
Since there will be hundreds or thousands of rules generated based on data, you need a couple of ways to visualize these association rules.
A straight-forward visualization of association rules is to use a scatter plot using plot() of the arulesViz package. It uses Support and Confidence on the axes. In addition, third measure Lift is used by default to color (grey levels) of the points.
# Filter rules with confidence greater than 0.4 or 40%
subRules<-association.rules[quality(association.rules)$confidence>0.4]
#Plot SubRules
plot(subRules)
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The above plot shows that rules with high lift have low support. You can use the following options for the plot:
plot(subRules,method="two-key plot")
## To reduce overplotting, jitter is added! Use jitter = 0 to prevent jitter.
The two-key plot uses support and confidence on x and y-axis respectively. It uses order for coloring. The order is the number of items in the rule.
Graph-Based Visualizations
Graph-based techniques visualize association rules using vertices and edges where vertices are labeled with item names, and item sets or rules are represented as a second set of vertices. Items are connected with item-sets/rules using directed arrows. Arrows pointing from items to rule vertices indicate LHS items and an arrow from a rule to an item indicates the RHS. The size and color of vertices often represent interest measures.
Graph plots are a great way to visualize rules but tend to become congested as the number of rules increases. So it is better to visualize less number of rules with graph-based visualizations.
Let’s select 10 rules from subRules having the highest confidence.
top10subRules <- head(subRules, n = 10, by = "confidence")
Now, plot an interactive graph:
Note: You can make all your plots interactive using engine=htmlwidget parameter in plot
plot(top10subRules, method = "graph", engine = "htmlwidget")
This representation is also called as Parallel Coordinates Plot.
As mentioned above, the RHS is the Consequent; the positions are in the LHS where 2 is the most recent addition to our basket and 1 is the item we previously had.
# Filter top 20 rules with highest lift
subRules2<-head(subRules, n=20, by="lift")
plot(subRules2, method="paracoord")
It shows that when the class has ‘ro1’, with high probability It will also have ‘mg1’.